Technologies for nucleotide sequence screening

ABSTRACT

Illustrative embodiments of technologies for nucleotide sequence screening are disclosed. In one illustrative embodiment, a system may include a server to communicate with a remote frontend over a network in order to receive a request to screen one or more nucleotide sequences for hazardous content and to report a result of the screening. The system may also include a compute engine to compare each nucleotide sequence to each of a plurality of reference sequences stored in a reference database, to detect whether hazardous content is present in each nucleotide sequence based upon the comparison of that nucleotide sequence to each of the plurality of reference sequences, and to assign one of a plurality of threat levels to each nucleotide sequence based upon the detection of whether hazardous content is present in that nucleotide sequence. The reported result may include the threat level assigned to each nucleotide sequence.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/705,191, filed Dec. 5, 2019, which claims the benefit of U.S.Provisional Patent Application No. 62/776,273, filed Dec. 6, 2018, theentire disclosures of which are incorporated herein by reference.

TECHNICAL FIELD

The present disclosure relates, generally, to nucleotide sequencescreening and, more particularly, to technologies for screeningnucleotide sequences for indications of hazardous content.

BACKGROUND

Gene synthesis, sometimes referred to as DNA printing, involves thecreation of artificial genes “de novo,” without the need forpreexisting, template DNA sequences. Gene synthesis approaches are mostoften based on a combination of organic chemistry and molecular biologytechniques. Gene synthesis is an important tool in many fields ofrecombinant DNA technology, including heterologous gene expression,vaccine development, gene therapy and molecular engineering. Thesynthesis of nucleic acid sequences can be more economical thanclassical cloning and mutagenesis procedures. It is also a powerful andflexible engineering tool for creating and designing new DNA sequencesand protein functions.

Commercial gene synthesis carries risk in that the synthesizing partytypically receives the requested sequence from a third party client andis unaware of the contents of the sequence to be synthesized. Whetherknown or unknown to the client, it is possible that the requestedsequence may contain hazardous content, such as genes relating todisease or biological weapons, by way of example. Lacking any technologyfor accurately and efficiently checking unknown sequences for hazardouscontent, however, the synthesizing party is unable to evaluate the riskof synthesizing the requested sequence up front. More generally, anyparty dealing with an unknown nucleotide sequence may benefit from animproved solution for screening the contents of that sequence (forhazardous content or otherwise).

SUMMARY

The following clauses, and combinations thereof, provide variousillustrative aspects of the inventions described herein. The variousillustrative embodiments described in any other section of this patentapplication, including the section titled “DETAILED DESCRIPTION OF THEDRAWINGS,” are applicable to any of the following embodiments of theinventions described in the numbered clauses below.

1. A system comprising a server to communicate with a remote frontendover a network to receive, from the frontend, a request to screen one ormore nucleotide sequences for hazardous content and to report, to thefrontend, a result of screening the one or more nucleotide sequences forhazardous content.

2. The system of clause 1, further comprising a compute engine tocompare each nucleotide sequence of the one or more nucleotide sequencesto each of a plurality of reference sequences stored in a referencedatabase, to detect whether hazardous content is present in eachnucleotide sequence based upon the comparison of that nucleotidesequence to each of the plurality of reference sequences, and to assignone of a plurality of threat levels to each nucleotide sequence basedupon the detection of whether hazardous content is present in thatnucleotide sequence.

3. The system of clause 2, wherein the result reported to the frontendby the server comprises the threat level assigned to each nucleotidesequence of the one or more nucleotide sequences.

4. The system of any one of clauses 2 and 3, wherein the plurality ofthreat levels includes at least a first level representing a threat, asecond level representing a potential threat, a third level representingan unlikely threat, and a fourth level representing a non-threat.

5. The system of any one of clauses 2-4, wherein to compare eachnucleotide sequence of the one or more nucleotide sequences to each ofthe plurality of reference sequences comprises using a basic localalignment search tool (BLAST) to compare each nucleotide sequence toeach of the plurality of reference sequences.

6. The system of any one of clauses 2-5, wherein each of the pluralityof reference sequences includes hazardous content.

7. The system of clause 6, wherein the reference database furthercomprises metadata associated with each reference sequence thatdescribes one or more characteristics of the hazardous content includedin that reference sequence.

8. The system of clause 7, wherein the compute engine is further toretrieve the corresponding metadata from the reference database inresponse to detecting that hazardous content is present in one of thenucleotide sequences.

9. The system of clause 8, wherein the result reported to the frontendby the server comprises the corresponding metadata for each nucleotidesequence for which hazardous content is detected.

10. The system of any one of clauses 2-9, wherein to detect whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises to select a reference sequence thatprovided a closest match to that nucleotide sequence during thecomparison of that nucleotide sequence to each of the plurality ofreference sequences, wherein the selected reference sequence includeshazardous content.

11. The system of clause 10, wherein to detect whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises to detect that hazardous content is present in thatnucleotide sequence in response to determining that (i) a matchinglength between the selected reference sequence and that nucleotidesequence exceeds a threshold length and (ii) a matching percentagebetween the selected reference sequence and that nucleotide sequenceexceeds a threshold percentage.

12. The system of any one of clauses 2-11, wherein to detect whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises to, for each reference sequence includinghazardous content where (i) a matching length between the referencesequence and that nucleotide sequence does not exceed a threshold lengthbut (ii) a matching percentage between the reference sequence and thatnucleotide sequence does exceed a threshold percentage, extending thematching length up to the threshold length.

13. The system of clause 12, wherein to detect whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises to detect that hazardous content is present in thatnucleotide sequence in response to determining that the matchingpercentage between the extended reference sequence and that nucleotidesequence still exceeds the threshold percentage.

14. The system of any one of clauses 2-13, wherein to detect whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises to, for each reference sequence includinghazardous content where (i) a matching length between the referencesequence and that nucleotide sequence does exceed a threshold length but(ii) a matching percentage between the reference sequence and thatnucleotide sequence does not exceed a threshold percentage, apply asliding window to analyze a matching percentage between that nucleotidesequence and each portion of the reference sequence having the thresholdlength.

15. The system of clause 14, wherein to detect whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises to detect that hazardous content is present in thatnucleotide sequence in response to determining that the matchingpercentage between that nucleotide sequence and any portion of thereference sequence having the threshold length exceeds the thresholdpercentage.

16. The system of any one of clauses 2-15, wherein to detect whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises to select a plurality of nucleotidesequence segments that each matched part of one of the plurality ofreference sequences including hazardous content, where a matching lengthbetween each selected nucleotide sequence segment and the correspondingpartial reference sequence does not exceed a threshold length.

17. The system of clause 16, wherein to detect whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises to combine the selected plurality of nucleotidesequence segments into a composite nucleotide sequence.

18. The system of clause 17, wherein to detect whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises to apply a sliding window to the composite nucleotidesequence to analyze a matching percentage between each portion of thecomposite nucleotide sequence having the threshold length and the one ofthe plurality of reference sequences.

19. The system of clause 18, wherein to detect whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises to detect that hazardous content is present in thecomposite nucleotide sequence in response to determining that thematching percentage between any portion of the composite nucleotidesequence having the threshold length and the one of the plurality ofreference sequences exceeds a threshold percentage.

20. The system of any one of clauses 1-19, wherein the frontend is toprovide a graphical user interface to allow a user to input the one ormore nucleotide sequences to be screened for hazardous content and todisplay to the user the result of screening the one or more nucleotidesequences for hazardous content.

21. The system of clause 20, wherein the graphical user interface isconfigured to allow the user to input a plurality of nucleotidesequences to be screened by uploading a single file containing theplurality of nucleotide sequences.

22. The system of any one of clauses 20 and 21, wherein the graphicaluser interface is configured to display to the user a progress of thescreening of the one or more nucleotide sequences for hazardous content,based upon asynchronous updates received from the server, until theresult is received from the server.

23. The system of any one of clauses 1-22, further comprising a workflowdatabase including a queue of nucleotide sequences to be screened forhazardous content, wherein the server is further to write each of theone or more nucleotide sequences received from the frontend to thequeue, and wherein the compute engine is further to read one nucleotidesequence at a time from the queue in order to compare that nucleotidesequence to each of the plurality of reference sequences.

24. A method comprising receiving, with a server from a remote frontendover a network, a request to screen one or more nucleotide sequences forhazardous content.

25. The method of clause 24, further comprising comparing, with acompute engine, each nucleotide sequence of the one or more nucleotidesequences to each of a plurality of reference sequences stored in areference database.

26. The method of clause 25, further comprising detecting, with thecompute engine, whether hazardous content is present in each nucleotidesequence based upon the comparison of that nucleotide sequence to eachof the plurality of reference sequences.

27. The method of clause 26, further comprising assigning, with thecompute engine, one of a plurality of threat levels to each nucleotidesequence based upon the detection of whether hazardous content ispresent in that nucleotide sequence.

28. The method of clause 27, further comprising reporting, from theserver to the frontend over the network, the threat level assigned toeach nucleotide sequence of the one or more nucleotide sequences.

29. The method of any one of clauses 25-28, wherein comparing eachnucleotide sequence of the one or more nucleotide sequences to each ofthe plurality of reference sequences comprises using a basic localalignment search tool (BLAST) to compare each nucleotide sequence toeach of the plurality of reference sequences.

30. The method of any one of clauses 25-29, wherein each of theplurality of reference sequences includes hazardous content.

31. The method of clause 30, wherein the reference database furthercomprises metadata associated with each reference sequence thatdescribes one or more characteristics of the hazardous content includedin that reference sequence.

32. The method of clause 31, further comprising retrieving, with thecompute engine, the corresponding metadata from the reference databasein response to detecting that hazardous content is present in one of thenucleotide sequences.

33. The method of clause 32, further comprising reporting, from theserver to the frontend over the network, the corresponding metadata foreach nucleotide sequence for which hazardous content is detected.

34. The method of any one of clauses 26-33, wherein detecting whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises selecting a reference sequence thatprovided a closest match to that nucleotide sequence during thecomparison of that nucleotide sequence to each of the plurality ofreference sequences, wherein the selected reference sequence includeshazardous content.

35. The method of clause 34, wherein detecting whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises detecting that hazardous content is present in thatnucleotide sequence in response to determining that (i) a matchinglength between the selected reference sequence and that nucleotidesequence exceeds a threshold length and (ii) a matching percentagebetween the selected reference sequence and that nucleotide sequenceexceeds a threshold percentage.

36. The method of any one of clauses 26-35, wherein detecting whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises, for each reference sequence includinghazardous content where (i) a matching length between the referencesequence and that nucleotide sequence does not exceed a threshold lengthbut (ii) a matching percentage between the reference sequence and thatnucleotide sequence does exceed a threshold percentage, extending thematching length up to the threshold length.

37. The method of clause 36, wherein detecting whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises detecting that hazardous content is present in thatnucleotide sequence in response to determining that the matchingpercentage between the extended reference sequence and that nucleotidesequence still exceeds the threshold percentage.

38. The method of any one of clauses 26-37, wherein detecting whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises, for each reference sequence includinghazardous content where (i) a matching length between the referencesequence and that nucleotide sequence does exceed a threshold length but(ii) a matching percentage between the reference sequence and thatnucleotide sequence does not exceed a threshold percentage, applying asliding window to analyze a matching percentage between that nucleotidesequence and each portion of the reference sequence having the thresholdlength.

39. The method of clause 36, wherein detecting whether hazardous contentis present in each nucleotide sequence based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequencesfurther comprises detecting that hazardous content is present in thatnucleotide sequence in response to determining that the matchingpercentage between that nucleotide sequence and any portion of thereference sequence having the threshold length exceeds the thresholdpercentage.

40. The method of any one of clauses 26-39, wherein detecting whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises selecting a plurality of nucleotidesequence segments that each matched part of one of the plurality ofreference sequences including hazardous content, where a matching lengthbetween each selected nucleotide sequence segment and the correspondingpartial reference sequence does not exceed a threshold length.

41. The method of claim 40, further comprising combining the selectedplurality of nucleotide sequence segments into a composite nucleotidesequence.

42. The method of claim 41, further comprising applying a sliding windowto the composite nucleotide sequence to analyze a matching percentagebetween each portion of the composite nucleotide sequence having thethreshold length and the one of the plurality of reference sequences.

43. The method of claim 44, further comprising detecting that hazardouscontent is present in the composite nucleotide sequence in response todetermining that the matching percentage between any portion of thecomposite nucleotide sequence having the threshold length and the one ofthe plurality of reference sequences exceeds a threshold percentage.

44. The method of any one of clauses 24-43, further comprising providinga graphical user interface to allow a user to input the one or morenucleotide sequences to be screened for hazardous content and to displayto the user the threat level assigned to each nucleotide sequence of theone or more nucleotide sequences.

45. The method of clause 44, wherein the graphical user interface isconfigured to allow the user to input a plurality of nucleotidesequences to be screened by uploading a single file containing theplurality of nucleotide sequences.

46. The method of clause 44 or clause 45, wherein the graphical userinterface is configured to display to the user a progress of thescreening of the one or more nucleotide sequences for hazardous content,based upon asynchronous updates received from the server, until theresult is received from the server.

47. The method of any one of clauses 24-46, further comprising writing,with the server, each of the one or more nucleotide sequences receivedfrom the remote frontend to a queue stored in a workflow databaseincluding a queue of nucleotide sequences to be screened for hazardouscontent.

48. The method of clause 48, further comprising reading, with thecompute engine, one nucleotide sequence at a time from the queue beforecomparing that nucleotide sequence to each of the plurality of referencesequences.

48. The method of any one of clauses 24-47, wherein the plurality ofthreat levels includes at least a first level representing a threat, asecond level representing a potential threat, a third level representingan unlikely threat, and a fourth level representing a non-threat.

BRIEF DESCRIPTION OF THE DRAWINGS

The concepts described in the present disclosure are illustrated by wayof example and not by way of limitation in the accompanying figures. Forsimplicity and clarity of illustration, elements illustrated in thefigures are not necessarily drawn to scale. For example, the dimensionsof some elements may be exaggerated relative to other elements forclarity. Further, where considered appropriate, reference labels havebeen repeated among the figures to indicate corresponding or analogouselements. The detailed description particularly refers to theaccompanying figures in which:

FIG. 1 is a simplified block diagram illustrating one embodiment of asystem for nucleotide sequence screening;

FIG. 2 is a simplified flow diagram illustrating one embodiment of amethod of nucleotide sequence screening that may be performed by thesystem of FIG. 1; and

FIG. 3 is a simplified diagram illustrating one embodiment of agraphical user interface that may be provided by a frontend of thesystem of FIG. 1.

DETAILED DESCRIPTION OF THE DRAWINGS

While the concepts of the present disclosure are susceptible to variousmodifications and alternative forms, specific embodiments thereof havebeen shown by way of example in the drawings and will herein bedescribed in detail. It should be understood, however, that there is nointent to limit the concepts of the present disclosure to the particularforms disclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives consistent with the presentdisclosure and the appended claims.

References in the specification to “one embodiment,” “an embodiment,”“an example embodiment,” etcetera, indicate that the embodimentdescribed may include a particular feature, structure, orcharacteristic, but every embodiment may not necessarily include theparticular feature, structure, or characteristic. Moreover, such phrasesare not necessarily referring to the same embodiment. Further, when aparticular feature, structure, or characteristic is described inconnection with an embodiment, it is submitted that it is within theknowledge of one skilled in the art to effect such feature, structure,or characteristic in connection with other embodiments whether or notexplicitly described.

Embodiments of the concepts described herein may be implemented inhardware, firmware, software, or any combination thereof. For instance,embodiments of the concepts described herein may be implemented as dataand/or instructions carried by or stored on one or more machine-readableor computer-readable storage media, which may be read and/or executed byone or more processors. A machine-readable or computer-readable storagemedium may be embodied as any device, mechanism, or physical structurefor storing or transmitting information in a form readable by a machine(e.g., a computing device or system). For example, a machine-readable orcomputer-readable storage medium may be embodied as read only memory(ROM) device(s); random access memory (RAM) device(s); magnetic diskstorage media; optical storage media; flash memory devices; mini- ormicro-SD cards, memory sticks, and others.

In the drawings, specific arrangements or orderings of schematicelements, such as those representing devices, modules, software, anddata elements, may be shown for ease of description. However, it shouldbe understood by those skilled in the art that the specific ordering orarrangement of the schematic elements in the drawings is not meant toimply that a particular order or sequence of processing, or separationof processes, is required. Further, the inclusion of a schematic elementin a drawing is not meant to imply that such element is required in allembodiments or that the features represented by such element may not beincluded in or combined with other elements in some embodiments.

In general, schematic elements used to represent software may beimplemented using any suitable form of machine-readable instruction,such as software or firmware applications, programs, functions, modules,routines, processes, procedures, plug-ins, applets, widgets, codefragments and/or others, and that each such instruction may beimplemented using any suitable programming language, library,application programming interface (API), and/or other softwaredevelopment tools. For example, some embodiments may be implementedusing Java, C++, and/or other programming languages. Similarly,schematic elements used to represent data or information may beimplemented using any suitable electronic arrangement or structure, suchas a register, data store, table, record, array, index, hash, map, tree,list, graph, file (of any file type), folder, directory, database,and/or others.

Further, in the drawings, where connecting elements, such as solid ordashed lines or arrows, are used to illustrate a connection,relationship or association between or among two or more other schematicelements, the absence of any such connecting elements is not meant toimply that no connection, relationship or association can exist. Inother words, some connections, relationships or associations betweenelements may not be shown in the drawings so as not to obscure thedisclosure. In addition, for ease of illustration, a single connectingelement may be used to represent multiple connections, relationships orassociations between elements. For example, where a connecting elementrepresents a communication of signals, data, instructions, or otherinformation, it should be understood by those skilled in the art thatsuch element may represent one or multiple signal paths, as may beneeded, to effect the communication.

Referring now to FIG. 1, one illustrative embodiment of a system 100 fornucleotide sequence screening is shown as a simplified block diagram. Inthis embodiment, the system 100 comprises a frontend 102, a server 104,a compute engine 106, a workflow database 108, and one or more referencedatabases 110. It is contemplated that the components of system 100(including any of the frontend 102, the server 104, the compute engine106, the workflow database 108, and the reference database(s) 110) mayeach be embodied in hardware, software, firmware, or any combinationthereof. It will also be appreciated that, in some embodiments, thesystem 100 may include additional and/or different components than thoseshown in FIG. 1.

The various components of the system 100 are communicatively coupled viaone or more wired and/or wireless networks (as illustrated by the arrowsin FIG. 1). For example, the frontend 102, the server 104, the computeengine 106, the workflow database 108, and the reference database(s) 110may each be communicatively coupled to some or all of the othercomponents of the system 100 via a wired or wireless local area network(LAN), a wired or wireless wide area network (WAN), a cellular network,and/or a publicly-accessible, global network, such as the Internet. Assuch, the system 100 may include any number of additional components,such as additional computers, routers, and switches, to facilitatecommunications among the components of the system 100. Due to thesenetwork connections, it is not necessary for any of the components ofsystem 100 to be physically located together. In the illustrativeembodiment, the frontend 102 is located remotely from at least theserver 104 (e.g., in another building, city, state, or country), and thefrontend 102 and server 104 are communicatively coupled via theInternet.

In the illustrative embodiment, the frontend 102 provides a webinterface through which a user of the system 100 can input one or morenucleotide sequences (e.g., DNA sequences) to be screened, can monitorprogress of the screening of inputted sequences, and can view theresults of completed screenings. The frontend 102 may be embodied as anytype of computation or computer device capable of performing thefunctions described herein, including, without limitation, a computer, amultiprocessor system, a server, a rack-mounted server, a blade server,a programmable logic controller, an embedded controller, an embeddedsystem, a processor-based system, and/or a consumer electronic device.The frontend 102 may alternatively be embodied as software and/orfirmware configured to execute on any of the foregoing devices toperform the functions described herein. As suggested in FIG. 1, thesystem 100 may comprise numerous instances of the frontend 102 (e.g.,associated with numerous users of the system 100).

In the illustrative embodiment, the server 104 receives requests fornucleotide sequence screening from the frontend 102, initiates new jobsbased on such requests, tracks the progress of these jobs as they run,and reports results of completed screenings to the frontend 102. Theserver 104 may be embodied as any type of computation or computer devicecapable of performing the functions described herein, including, withoutlimitation, a computer, a multiprocessor system, a server, arack-mounted server, a blade server, a programmable logic controller, anembedded controller, an embedded system, a processor-based system,and/or a consumer electronic device. For example, the server 104 may beembodied as a web server accessible over a public network (e.g., a cloudserver). Additionally or alternatively, the server 104 may be embodiedas a local gateway device accessible over a local area network or othernetwork. Additionally, in some embodiments, the server 104 may beembodied as a “virtual server” formed from multiple computing devicesdistributed across one or more networks and operating in a public orprivate cloud. Accordingly, although the server 104 is illustrated inFIG. 1 as embodied as a single server computing device, it should beappreciated that the server 104 may be embodied as multiple devicescooperating together to facilitate the functionality described below. Insome embodiments (not shown) the server 104 and the compute engine 106may both be embodied in the same physical server device or collection ofdevices.

In the illustrative embodiment, the compute engine 106 is embodied as anapplication for running comparisons between the nucleotide sequencesbeing screened and reference sequences (stored in reference database(s)110), analyzing the results of these comparisons to detect hazardouscontents in the nucleotide sequences being screened, and updating theserver 104 with its progress on these tasks. As suggested in FIG. 1, thesystem 100 may comprise numerous instances of the compute engine 106(e.g., running on a cloud). The compute engine 106 may run on (or,alternatively, be embodied as) any type of computation or computerdevice capable of performing the functions described herein, including,without limitation, a computer, a multiprocessor system, a server, arack-mounted server, a blade server, a programmable logic controller, anembedded controller, an embedded system, a processor-based system,and/or a consumer electronic device. For example, the compute engine 106may be run on (or embodied as) a web server accessible over a publicnetwork (e.g., a cloud server). Additionally or alternatively, thecompute engine 106 may run on (or be embodied as) a local gateway deviceaccessible over a local area network or other network. Additionally, insome embodiments, the compute engine 106 may run on (or be embodied as)a “virtual server” formed from multiple computing devices distributedacross one or more networks and operating in a public or private cloud.As noted above, in some embodiments (not shown), the server 104 and thecompute engine 106 may both be embodied in (or run on) the same physicalserver device or collection of devices.

In the illustrative embodiment, the workflow database 108 is embodied asSQLite3 database for tracking jobs initiated by the server 104 andperformed by the compute engine 106 and the results of those jobs oncecompleted. The workflow database 108 may alternatively be embodied asany number of data structures stored on any type of computer-readablemedia. For instance, in some embodiments, the workflow database 108 maybe combined with the server 104 and/or the compute engine 106.

The workflow database 108 illustratively maintains two tables of data.In a first data table, the workflow database 108 includes a queue ofnucleotide sequences to be screened for hazardous content. Each entry inthis table reflects a single nucleotide sequence to be run against aspecific reference database 110 (by the compute engine 106). Each entryin the first data table of the workflow database 108 may includetimestamps of when the run is started and when it is finished. Thistable may also group multiple runs (entries) together through the use ofa “job identification” (job_id) field, without needing to create aseparate table. Each entry in the first data table may also include anoptions field that contains a JavaScript Object Notation (JSON) encodeddictionary of options that may be passed to the compute engine 106 toinfluence how the run is handled. The workflow database 108 may alsoinclude a second data table to hold the results of the analysesperformed by the compute engine 106. Additionally, each entry in thissecond table can specify the analysis performed by the compute engine106 via “name” and “method” fields, by way of example.

The frontend 102 is operable to provide a graphical user interface (GUI)300, one simplified example of which is shown in FIG. 3. One workingembodiment of this frontend 102 was built as a Single Page WebApplication (SPA) using React, a javascript library for creatinginteractive user interfaces. As shown in FIG. 3, the GUI 300 includes aninput box 302 that allows a user to input one or more nucleotidesequences to be screened for hazardous content. In this embodiment, theinput box 302 allows a user to type sequences or copy-and-pastesequences to be screened. Additionally, the input box 302 allows foruser to drag-and-drop text files containing nucleotide sequences (e.g.,FASTA files) to upload them to the system 100 for screening. In thisway, the user can easily input a large group of nucleotide sequences tobe screened by uploading a single file.

The GUI 300 also includes a button 304 (labelled “submit” in FIG. 3) tocomplete the input process and begin the screening of any inputtedsequences (whether manually typed into input box 302 or uploaded via afile). When a user clicks the “submit” button 304, a post request issent to the server 104 to start the job. Pressing the button 304 alsoinitiates the progress/results window 306 of the GUI 300.

The window 306 is utilized by the GUI 300 to display progress and/orresults on each job submitted by a user of the frontend 102 of thesystem 100. When the window 306 is initiated (after button 304 isclicked), the frontend 102 opens a websocket connection to the server104. As the frontend 102 receives asynchronous updates about theprogress of the job from the server 104, the GUI 300 of the frontend 102displays the progress to the user in window 306. For instance, in FIG.3, the window 306 is displaying the status of the jobs entitled“Sequence5” and “Sequence6” as “running” and “waiting,” respectively.Once the job is complete, the GUI 300 of frontend 102 displays theresult of each screening to the user in window 306. As illustrated inFIG. 3, this result may take the form of a threat level assigned to eachnucleotide sequence by the compute engine 106. In this illustrativeembodiment, the threat level may take one of four values: “threat,“potential threat,” “unlikely threat,” and “non-threat.” It will beappreciated that other threat levels (including different numbers ofthreat levels) might be used in other embodiments. In the illustrativeembodiment, the portion of window 306 including the assigned threatlevel is also color coded to highlight the result (e.g., with greaterthreat levels being presented on red backgrounds of differing intensityand lesser threat levels being presented on green backgrounds ofdiffering intensity).

The GUI 300 of the frontend 102 may also include a modal (not shown) forpresenting additional results of the screening. This modal may becomevisible when a user clicks on or hovers the mouse over a particularportion of the GUI 300, such as the assigned threat levels in window306. For instance, as described in more detail below, the referencedatabase(s) 110 may include numerous types of metadata associated witheach reference sequence, where the metadata describes one or morecharacteristics of the hazardous content included in that referencesequence. By way of example, the metadata might include a description ofa Virulence Factor (VF) for the hazardous content, the VF's function,and the like. In such embodiments, this metadata may be retrieved fromthe reference database(s) 110 for nucleotide sequences hitting on theseresults and provide to the frontend 102 via the server 104. When a userof the GUI 300 clicks on one of the assigned threat levels in window 306that indicates a threat, the modal may become visible (“pop-up”) todisplay some or all of the metadata retrieved from the referencedatabase(s) 110 relating to that nucleotide sequence and its hazardouscontent.

In one embodiment, the server 104 was implemented as a python webservercapable of handling Hypertext Transfer Protocol (HTTP) requests from thefrontend 102 and the compute engine 106. In this embodiment, the server104 implemented various endpoints, including an initiation endpoint, anotification endpoint, and a progress-monitoring endpoint. Theinitiation endpoint operates to receive one or more nucleotide sequencesfrom the frontend 102 to be screened. This endpoint writes each sequencereceived to the queue (the first data table) maintained by the workflowdatabase 108. The server 104 also specifies the analyses to beperformed, including the reference database(s) 110 to be used, by thecompute engine 106. This information is all stored in the workflowdatabase 108 together with null fields to be filled in with results fromthe compute engine 106 upon the completion of each run.

The notification endpoint is used by the compute engine 106 to notifythe server 104 whenever there is an update on the progress of aparticular job. If there are any open websockets listening for the jobID associated with an update provided by the compute engine 106, theserver 104 sends an update to the associated frontend 102. Thewebsockets between the frontend 102 and the server 104 utilize theprogress-monitoring endpoint. This endpoint adds the job ID associatedwith each open websocket to a list of IDs to monitor until theconnection is no longer open. Whenever updates are triggered for a jobwith this ID, the server 104 will poll the workflow database 108 aboutall the runs and analyses associated with that ID. The server 104compiles this information and sends it to the frontend 102 as anasynchronous update.

The compute engine 106 periodically polls the workflow database 108 fornew runs and, when found, executes them. To do so, the compute engine106 first compares the nucleotide sequence to a number of referencesequences stored in one of the reference databases 110. For instance,the compute engine 106 may access a reference database 110 that containsreference sequences known to include hazardous content (e.g., DNAsequences associate with disease, biological weapons, and the like). Thecompute engine 106 may compare the nucleotide sequence being run witheach of the reference sequences using a basic local alignment searchtool (BLAST) that identifies matching sequences within certainconstraints, such as a certain percentage matching (a certain thresholdof matching nucleotide bases) over a certain sequence length. Thecompute engine 106 may use any of the many known BLAST algorithms toperform this comparison. The workflow database 108 may specify whichBLAST algorithm and/or reference database 110 are to be used for aparticular run, or it may provide the compute engine 106 with data usedto select an appropriate BLAST algorithm and/or appropriate referencedatabase 110 to be used for a particular run. The workflow database 108may also specify certain options (e.g., tolerances) to be used whenrunning the BLAST algorithm to perform the comparisons.

After comparing the nucleotide sequence being run to the referencesequences of one of the reference databases 110, the compute engine 106analyzes the results of the run to detect whether hazardous content ispresent in the nucleotide sequence and to assign one of a number ofpossible threat levels to the nucleotide sequence based upon thisanalysis. The compute engine 106 also retrieves a corresponding metadataassociated with detected hazardous content from the reference database110 and performs an necessary post-processing before returning theresults to the workflow database 108 (e.g., aggragating match resultsinto a historical database, or queueing a run against another database).The hazardous content detection algorithm (as well as the metadataretrieval and post-processing functions) used by the compute engine 106can be specified by the “name” and “method” fields associated with eachrun in the workflow database 108.

In some embodiments, the compute engine 106 detects whether hazardouscontent is present in the nucleotide sequence by selecting a referencesequence that provided the closest match to the nucleotide sequenceduring the comparison of the nucleotide sequence to each of theplurality of reference sequences. The compute engine 106 then determineswhether a matching length between the selected “closest match” referencesequence and the nucleotide sequence exceeds a threshold length andwhether a matching percentage between the selected “closest match”reference sequence and the nucleotide sequence exceeds a thresholdpercentage. If the minimum length and percentage matching criteria aremet, the compute engine 106 consider the nucleotide sequence to be apositive match to that reference sequence and flags the nucleotidesequence as containing whatever hazardous content is represented by thereference sequence.

In other embodiments, the compute engine 106 detects whether hazardouscontent is present in the nucleotide sequence by identifying eachreference sequence where a matching length between the referencesequence and the nucleotide sequence does not exceed a threshold lengthbut a matching percentage between the reference sequence and thenucleotide sequence does exceed a threshold percentage. For such a case,the compute engine 106 will extend (or scale) the matching length up tothe threshold length determine whether the matching percentage betweenthe extended reference sequence and the nucleotide sequence stillexceeds the threshold percentage. If so, the compute engine 106 willconsider the nucleotide sequence to be a positive match to thatreference sequence and will flag the nucleotide sequence as containingwhatever hazardous content is represented by the reference sequence.

In some embodiments, the compute engine 106 detects whether hazardouscontent is present in the nucleotide sequence by identifying eachreference sequence where a matching length between the referencesequence and the nucleotide sequence does exceed a threshold length buta matching percentage between the reference sequence and the nucleotidesequence does not exceed a threshold percentage. For such a case, thecompute engine 106 will apply a sliding window to analyze a matchingpercentage between the nucleotide sequence and each portion of thereference sequence having the threshold length. If the compute engine106 determines that the matching percentage between the nucleotidesequence and any portion of the reference sequence having the thresholdlength exceeds the threshold percentage, it will consider the nucleotidesequence to be a positive match to that reference sequence and will flagthe nucleotide sequence as containing whatever hazardous content isrepresented by the reference sequence.

In still other embodiments, the compute engine 106 may detect whetherhazardous content is present in the nucleotide sequence by select agroup of nucleotide sequence segments that each matched part of one ofthe reference sequences, but where a matching length between eachselected nucleotide sequence segment and the corresponding partialreference sequence did not exceed a threshold length. In such a case,the compute engine 106 will combine the selected plurality of nucleotidesequence segments into a composite nucleotide sequence and apply asliding window to the composite nucleotide sequence to analyze amatching percentage between each portion of the composite nucleotidesequence having the threshold length and the reference sequence. If thecompute device determines that the matching percentage between anyportion of the composite nucleotide sequence having the threshold lengthand the reference sequence exceeds the threshold percentage, the computeengine 106 will consider the composite nucleotide sequence to be apositive match to that reference sequence and will flag the nucleotidesequence as containing whatever hazardous content is represented by thereference sequence. This last approach catches cases where smallersegments of threat factors that would normally fail the thresholdrequirements are placed within a larger sequence. The compute engine 106can apply this same approach to combine segments across all runs in asingle job (not just a single run) as well.

Referring now to FIG. 2, one illustrative embodiment of a method 200 ofnucleotide sequence screening is shown as a simplified flow diagram. Themethod 200 is illustrated as a number of blocks 202-220. Although theblocks 202-220 are generally shown and described sequentially in thepresent disclosure, it will be appreciated that the blocks 202-220 donot necessarily need to be performed in a particular order (unlessotherwise noted below). For instance, it is contemplated that many ofthe blocks 202-220 might be performed in parallel with other blocksduring the method 200.

The method 200 begins with block 202 in which the server 104 receives arequest from the frontend 102 to screen one or more nucleotide sequencesfor hazardous content. As discussed above, the server 104 may receivethis request and the associated nucleotide sequence(s) via theinitiation endpoint. Block 202 may involve the server 104 writing thenucleotide sequence(s) to be screened to the queue in the workflowdatabase 108.

After block 202, the method 200 proceeds to block 204 in which thecompute engine 106 compares each nucleotide sequence of the one or morenucleotide sequences to each of a plurality of reference sequencesstored in reference database 110. Block 204 may being with the computeengine 106 retrieving a nucleotide sequence to be screened from thequeue maintained by the workflow database 108. During block 204, thecompute engine 106 may use any suitable algorithm to compare thenucleotide sequence being run to the references sequences of one or moreof the reference databases 110. In the illustrative embodiment, block204 involves the compute engine 106 using a basic local alignment searchtool (BLAST) to compare the nucleotide sequence to each of the referencesequences.

After block 204, the method 200 proceeds to block 206 in which thecompute engine 106 detects whether hazardous content is present in eachnucleotide sequence based upon the comparison of that nucleotidesequence to each of the plurality of reference sequences performed inblock 202. Block 206 involves determining whether each nucleotidesufficiently matched one (or more) of the reference sequences includinghazardous content. In this way, block 206 functions as a filter on theresults of block 204 to find the matches that are true threats.

As discuss above, the detection of hazardous content in block 206 usingthe results of the comparisons from block 204 may take a number of formsin various embodiments. For instance, as represented in optional block208, in some embodiments, the compute engine 106 detects whetherhazardous content is present in the nucleotide sequence by selecting areference sequence that provided the closest match to the nucleotidesequence during the comparison of the nucleotide sequence to each of theplurality of reference sequences. The compute engine 106 then determineswhether a matching length between the selected “closest match” referencesequence and the nucleotide sequence exceeds a threshold length andwhether a matching percentage between the selected “closest match”reference sequence and the nucleotide sequence exceeds a thresholdpercentage. If the minimum length and percentage matching criteria aremet, the compute engine 106 consider the nucleotide sequence to be apositive match to that reference sequence and flags the nucleotidesequence as containing whatever hazardous content is represented by thereference sequence.

As represented in optional block 210, in other embodiments, the computeengine 106 detects whether hazardous content is present in thenucleotide sequence by identifying each reference sequence where amatching length between the reference sequence and the nucleotidesequence does not exceed a threshold length but a matching percentagebetween the reference sequence and the nucleotide sequence does exceed athreshold percentage. For such a case, the compute engine 106 willextend (or scale) the matching length up to the threshold lengthdetermine whether the matching percentage between the extended referencesequence and the nucleotide sequence still exceeds the thresholdpercentage. If so, the compute engine 106 will consider the nucleotidesequence to be a positive match to that reference sequence and will flagthe nucleotide sequence as containing whatever hazardous content isrepresented by the reference sequence.

As represented in optional block 212, in other embodiments, the computeengine 106 detects whether hazardous content is present in thenucleotide sequence by identifying each reference sequence where amatching length between the reference sequence and the nucleotidesequence does exceed a threshold length but a matching percentagebetween the reference sequence and the nucleotide sequence does notexceed a threshold percentage. For such a case, the compute engine 106will apply a sliding window to analyze a matching percentage between thenucleotide sequence and each portion of the reference sequence havingthe threshold length. If the compute engine 106 determines that thematching percentage between the nucleotide sequence and any portion ofthe reference sequence having the threshold length exceeds the thresholdpercentage, it will consider the nucleotide sequence to be a positivematch to that reference sequence and will flag the nucleotide sequenceas containing whatever hazardous content is represented by the referencesequence.

As represented in optional block 214, in still other embodiments, thecompute engine 106 may detect whether hazardous content is present inthe nucleotide sequence by select a group of nucleotide sequencesegments that each matched part of one of the reference sequences, butwhere a matching length between each selected nucleotide sequencesegment and the corresponding partial reference sequence did not exceeda threshold length. In such a case, the compute engine 106 will combinethe selected plurality of nucleotide sequence segments into a compositenucleotide sequence and apply a sliding window to the compositenucleotide sequence to analyze a matching percentage between eachportion of the composite nucleotide sequence having the threshold lengthand the reference sequence. If the compute device determines that thematching percentage between any portion of the composite nucleotidesequence having the threshold length and the reference sequence exceedsthe threshold percentage, the compute engine 106 will consider thecomposite nucleotide sequence to be a positive match to that referencesequence and will flag the nucleotide sequence as containing whateverhazardous content is represented by the reference sequence. This lastapproach catches cases where smaller segments of threat factors thatwould normally fail the threshold requirements are placed within alarger sequence. In alternative embodiments of block 214, the computeengine 106 can apply this same approach to combine segments across allruns in a single job (not just a single run).

After block 206 (including any of optional blocks 208-214), the method200 may optionally proceed to block 216 in which the compute engine 106retrieves corresponding metadata from the reference database 110 inresponse to detecting that hazardous content is present in one of thenucleotide sequences. For instance, the reference database(s) 110 mayinclude numerous types of metadata associated with each referencesequence, where the metadata describes one or more characteristics ofthe hazardous content included in that reference sequence. By way ofexample, the metadata might include a description of a Virulence Factor(VF) for the hazardous content, the VF's function, and the like.

After block 216 (when used), or after block 214 (if block 216 is notused), the method 200 proceeds to block 218 in which the compute engine106 assigns one of a plurality of threat levels to each nucleotidesequence based upon the detection of whether hazardous content ispresent in that nucleotide sequence. Depending on how closely (or not)the nucleotide sequence being screened was determined to match one ormore of the reference sequences including hazardous content in block206, the nucleotide sequence may be assigned any of a first levelrepresenting a threat, a second level representing a potential threat, athird level representing an unlikely threat, or a fourth levelrepresenting a non-threat in block 218.

After block 218, the method 200 concludes with block 220 in which thecompute engine 106 reports the result of screening the one or morenucleotide sequences for hazardous content to the frontend 102. In mostembodiments, block 220 will involve the server 104 reporting the threatlevel assigned to each nucleotide sequence of the one or more nucleotidesequences to the frontend 102. Additionally, in some embodiments (whereoptional block 216 is utilized), block 220 may involve the server 104reporting the corresponding metadata for each nucleotide sequence forwhich hazardous content is detected to the frontend 102. After receivingthese results, the frontend 102 may display this information to a userusing the GUI 300 (FIG. 3).

While certain illustrative embodiments have been described in detail inthe figures and the foregoing description, such an illustration anddescription is to be considered as exemplary and not restrictive incharacter, it being understood that only illustrative embodiments havebeen shown and described and that all changes and modifications thatcome within the spirit of the disclosure are desired to be protected.There are a plurality of advantages of the present disclosure arisingfrom the various features of the methods, systems, and articlesdescribed herein. It will be noted that alternative embodiments of themethods, systems, and articles of the present disclosure may not includeall of the features described yet still benefit from at least some ofthe advantages of such features. Those of ordinary skill in the art mayreadily devise their own implementations of the methods, systems, andarticles that incorporate one or more of the features of the presentdisclosure.

1. A system for assessing threat levels associated with synthesizingnucleotide sequences, the system comprising: a server to communicatewith a number of remote frontends over a network to receive, from thefrontends, a number of requests to screen one or more nucleotidesequences to be synthesized for hazardous content, wherein the requestscollectively include a plurality of nucleotide sequences comprising atleast one thousand nucleotide sequences; and a compute engine to:compare each nucleotide sequence of the plurality of nucleotidesequences to each of a plurality of reference sequences stored in areference database, detect whether hazardous content is present in eachnucleotide sequence of the plurality of nucleotide sequences based uponthe comparison of that nucleotide sequence to each of the plurality ofreference sequences, including performing a sliding window analysis on acomposite sequence formed from portions of nucleotide sequences in theplurality of nucleotide sequences, and assign one of a plurality ofthreat levels to each nucleotide sequence based upon the detection ofwhether hazardous content is present in that nucleotide sequence;wherein the server is further to report the threat level assigned tosynthesizing each nucleotide sequence to the respective front end thatrequested screening of that nucleotide sequence.
 2. The system of claim1, wherein to detect whether hazardous content is present in eachnucleotide sequence based upon the comparison of that nucleotidesequence to each of the plurality of reference sequences comprises to:select a reference sequence that provided a closest match to thatnucleotide sequence during the comparison of that nucleotide sequence toeach of the plurality of reference sequences, wherein the selectedreference sequence includes hazardous content; and detect that hazardouscontent is present in that nucleotide sequence in response todetermining that (i) a matching length between the selected referencesequence and that nucleotide sequence satisfies a threshold length and(ii) a matching percentage between the selected reference sequence andthat nucleotide sequence satisfies a threshold percentage.
 3. The systemof claim 1, wherein to detect whether hazardous content is present ineach nucleotide sequence based upon the comparison of that nucleotidesequence to each of the plurality of reference sequences comprises to:for each reference sequence including hazardous content where (i) amatching length between the reference sequence and that nucleotidesequence does not satisfy a threshold length but (ii) a matchingpercentage between the reference sequence and that nucleotide sequencedoes satisfy a threshold percentage, extending the matching length up tothe threshold length; and detect that hazardous content is present inthat nucleotide sequence in response to determining that the matchingpercentage between the extended reference sequence and that nucleotidesequence still satisfies the threshold percentage.
 4. The system ofclaim 1, wherein to detect whether hazardous content is present in eachnucleotide sequence based upon the comparison of that nucleotidesequence to each of the plurality of reference sequences comprises to:for each reference sequence including hazardous content where (i) amatching length between the reference sequence and that nucleotidesequence satisfies a threshold length but (ii) a matching percentagebetween the reference sequence and that nucleotide sequence does notsatisfy a threshold percentage, apply a sliding window to analyze amatching percentage between that nucleotide sequence and portions of thereference sequence having the threshold length; and detect thathazardous content is present in that nucleotide sequence in response todetermining that the matching percentage between that nucleotidesequence and any portion of the reference sequence having the thresholdlength satisfies the threshold percentage.
 5. The system of claim 1,wherein to detect whether hazardous content is present in eachnucleotide sequence based upon the comparison of that nucleotidesequence to each of the plurality of reference sequences comprises todetect that hazardous content is present in the composite nucleotidesequence in response to determining that a matching percentage betweenany portion of the composite nucleotide sequence having a thresholdlength and the one of the plurality of reference sequences satisfies athreshold percentage.
 6. The system of claim 1, wherein each of theplurality of reference sequences includes hazardous content, wherein thereference database further comprises metadata associated with eachreference sequence that describes one or more characteristics of thehazardous content included in that reference sequence, and wherein thecompute engine is configured to retrieve, in response to detecting thathazardous content is present in one of the nucleotide sequences, thecorresponding metadata from the reference database, and wherein theserver is further to report the corresponding metadata for eachnucleotide sequence for which hazardous content is detected to therespective front end that requested screening of that nucleotidesequence.
 7. The system of claim 1, wherein the plurality of threatlevels includes at least a first level representing a threat, a secondlevel representing a potential threat, a third level representing anunlikely threat, and a fourth level representing a non-threat.
 8. Thesystem of claim 1, wherein to compare each nucleotide sequence of theplurality of nucleotide sequences to each of the plurality of referencesequences comprises using an alignment algorithm to compare eachnucleotide sequence to each of the plurality of reference sequences. 9.The system of claim 1, wherein each frontend is to provide a graphicaluser interface to allow a user to input the one or more nucleotidesequences to be screened for hazardous content and to display to theuser the threat level assigned to synthesizing each of the one or morenucleotide sequences input by the user.
 10. The system of claim 9,wherein the graphical user interface is configured to allow the user toinput multiple nucleotide sequences to be screened by uploading a singlefile containing the multiple nucleotide sequences.
 11. The system ofclaim 9, wherein the graphical user interface is configured to displayto the user a progress of the screening of the one or more nucleotidesequences for hazardous content, based upon asynchronous updatesreceived from the server, until the threat level is received from theserver.
 12. The system of claim 1, further comprising a workflowdatabase including a queue of nucleotide sequences to be screened forhazardous content, wherein the server is further to write each of theplurality of nucleotide sequences received from the frontends to thequeue, and wherein the compute engine is further to read one nucleotidesequence at a time from the queue in order to compare that nucleotidesequence to each of the plurality of reference sequences.
 13. A methodfor assessing threat levels associated with synthesizing one or morenucleotide sequences, the method comprising: comparing, with a computeengine, each nucleotide sequence of the one or more nucleotide sequencesto each of a plurality of reference sequences stored in a referencedatabase; detecting, with the compute engine, whether hazardous contentis present in each nucleotide sequence of the one or more nucleotidesequences based upon the comparison of that nucleotide sequence to eachof the plurality of reference sequences, including performing a slidingwindow analysis on a composite sequence formed from portions ofnucleotide sequences in the plurality of nucleotide sequences;assigning, with the compute engine, one of a plurality of threat levelsto synthesizing each nucleotide sequence based upon the detection ofwhether hazardous content is present in that nucleotide sequence;determining that a threat level assigned to a nucleotide sequence of theone or more nucleotide sequences does not represent a threat; andsynthesizing, after determining that the threat level does not representa threat, the nucleotide sequence of the one or more nucleotidesequences.
 14. The method of claim 13, wherein detecting whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises: selecting a reference sequence thatprovided a closest match to that nucleotide sequence during thecomparison of that nucleotide sequence to each of the plurality ofreference sequences, wherein the selected reference sequence includeshazardous content; and detecting that hazardous content is present inthat nucleotide sequence in response to determining that (i) a matchinglength between the selected reference sequence and that nucleotidesequence satisfies a threshold length and (ii) a matching percentagebetween the selected reference sequence and that nucleotide sequencesatisfies a threshold percentage.
 15. The method of claim 13, whereindetecting whether hazardous content is present in each nucleotidesequence based upon the comparison of that nucleotide sequence to eachof the plurality of reference sequences comprises: for each referencesequence including hazardous content where (i) a matching length betweenthe reference sequence and that nucleotide sequence does not satisfy athreshold length but (ii) a matching percentage between the referencesequence and that nucleotide sequence does satisfy a thresholdpercentage, extending the matching length up to the threshold length;and detecting that hazardous content is present in that nucleotidesequence in response to determining that the matching percentage betweenthe extended reference sequence and that nucleotide sequence stillsatisfies the threshold percentage.
 16. The method of claim 13, whereindetecting whether hazardous content is present in each nucleotidesequence based upon the comparison of that nucleotide sequence to eachof the plurality of reference sequences comprises: for each referencesequence including hazardous content where (i) a matching length betweenthe reference sequence and that nucleotide sequence satisfy a thresholdlength but (ii) a matching percentage between the reference sequence andthat nucleotide sequence does not satisfy a threshold percentage,applying a sliding window to analyze a matching percentage between thatnucleotide sequence and portions of the reference sequence having thethreshold length; and detecting that hazardous content is present inthat nucleotide sequence in response to determining that the matchingpercentage between that nucleotide sequence and any portion of thereference sequence having the threshold length satisfies the thresholdpercentage.
 17. The method of claim 13, wherein detecting whetherhazardous content is present in each nucleotide sequence based upon thecomparison of that nucleotide sequence to each of the plurality ofreference sequences comprises detecting that hazardous content ispresent in the composite nucleotide sequence in response to determiningthat a matching percentage between any portion of the compositenucleotide sequence having a threshold length and the one of theplurality of reference sequences satisfies a threshold percentage. 18.The method of claim 13, wherein each of the plurality of referencesequences includes hazardous content, wherein the reference databasefurther comprises metadata associated with each reference sequence thatdescribes one or more characteristics of the hazardous content includedin that reference sequence, and wherein the method further comprisesretrieving, with the compute engine, the corresponding metadata from thereference database in response to detecting that hazardous content ispresent in one of the nucleotide sequences.
 19. The method of claim 13,wherein comparing each nucleotide sequence of the one or more nucleotidesequences to each of the plurality of reference sequences comprisesusing an alignment algorithm to compare each nucleotide sequence to eachof the plurality of reference sequences.
 20. A system for assessingthreat levels associated with synthesizing nucleotide sequences, thesystem comprising: a server to communicate with a remote frontend over anetwork to receive, from the frontend, a request to screen a pluralityof nucleotide sequences to be synthesized for hazardous content and toreport, to the frontend, a result of screening the plurality ofnucleotide sequences to be synthesized for hazardous content, whereinthe frontend is to provide a graphical user interface to allow a user toinput the plurality of nucleotide sequences to be screened for hazardouscontent by uploading a single file containing the plurality ofnucleotide sequences and to display to the user the result of screeningthe plurality of nucleotide sequences for hazardous content; and aprocessor to: compare each nucleotide sequence of the plurality ofnucleotide sequences to each of a plurality of reference sequencesstored in a reference database, wherein the plurality of nucleotidesequences includes at least a thousand nucleotide sequences; detectwhether hazardous content is present in each nucleotide sequence of theplurality of nucleotide sequences based upon the comparison of thatnucleotide sequence to each of the plurality of reference sequences,including performing a sliding window analysis on a composite sequenceformed from portions of nucleotide sequences in the plurality ofnucleotide sequences; and assign one of a plurality of threat levels tosynthesizing each nucleotide sequence based upon the detection ofwhether hazardous content is present in that nucleotide sequence;wherein the result reported to the frontend by the server comprises thethreat level assigned to synthesizing each nucleotide sequence of theplurality of nucleotide sequences.